perm filename VIS[0,BGB]15 blob sn#115055 filedate 1974-08-12 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00017 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00003 00002	{⊂C<NαVISION THEORY.λ30P66I425,0JCFA} SECTION 6.
C00005 00003	⊂6.1	A Geometric Feedback Vision System.⊃
C00007 00004		Between the top and  the bottom, between images and  the task
C00009 00005	{|λ10JUFA}
C00012 00006		The  lower part  of the  above
C00017 00007	⊂6.2	Vision Tasks.⊃
C00019 00008		First, there  is the  robot chauffer  task.   In 1969,   John
C00023 00009
C00028 00010
C00030 00011	⊂6.3	Vision System Design Arguments.⊃
C00037 00012
C00041 00013	⊂6.4	Mobile Robot Vision.⊃
C00045 00014
C00049 00015	{|λ10JAFA}
C00052 00016	
C00056 00017	⊂6.5	Summary and Related Vision Work.⊃
C00064 ENDMK
C⊗;
{⊂C;<N;αVISION THEORY.;λ30;P66;I425,0;JCFA} SECTION 6.
{JCFD}                   COMPUTER VISION THEORY.
{λ10;W250;JAFA}
	6.0	Introduction to Computer Vision Theory.
	6.1	A Geometric Feedback Vision System.
	6.2	Vision Tasks.
	6.3	Vision System Design Arguments.
	6.4	Mobile Robot Vision.
	6.5	Summary and Related Vision Work.

{λ30;W0;I900,0;JUFA}
⊂6.0	Introduction to Computer Vision Theory.⊃

	Computer vision concerns programming a computer  to do a task
that  demands the  use of  an image  forming light  sensor such  as a
television camera.  The theory I intend to elaborate is  that general
3-D  vision is  a continuous  process of  keeping an  internal visual
simulator  in sync with  perceived images of the  external reality so
that vision tasks  can be done more  by reference to the  simulator's
model  and  less  by  reference  to  the original  images.  The  word
<theory>, as used here,  means simply a set of statements  presenting
a systematic view of  a subject; specifically, I wish  to exclude the
connotation that  the theory is a natural  theory of vision. Perhaps
there can be  such a thing  as an  <artificial theory> which  extends
from the philosophy thru the design of an artifact.

⊂6.1	A Geometric Feedback Vision System.⊃

	Vision systems mediate between images and world models, these
two  extremes  of a  vision system  are  called, in  the  jargon, the
<bottom> and  the <top> respectively.   In  what follows,   the  word
<image> will be used  to refer to the notion of  a 2-D data structure
representing  a picture; a  picture being a rectangle  taken from the
pattern  of  light  formed  by  a  thin  lens  on   the  nearly  flat
photoelectric surface of a  television camera's vidicon. On the other
hand, a  <world model>  is  a data  structure  which is  supposed  to
represent the physical world for the purposes of a task processor. In
particular,  the  main  point of  this  thesis  concerns isolating  a
portion of the world model (called the 3-D geometric world model) and
placing it below most of the other entities that a task processor has
to deal with.  The vision hierarcy, so formed,  is illustrated in box 6.1.
{|λ10;JA}
BOX 6.1 {JC} VISION SYSTEM HIERARCY.

{JC} Task Processor
{JC} |
{JC} Task World Model
		 The  Top  → {JC} |
{JC} 3-D Geometric Model
{JC} |
		 The Bottom → {JC} 2-D Images
{|λ30;JU}
	Between the top and  the bottom, between images and  the task
world model,  a general vision system has three distinguishable modes
of operation: recognition,  verification and description. Recognition
vision can be  characterized as bottom up, what is  in the picture is
determined  by extracting  a set  of features  from the image  and by
classifing them  with respect  to  prejudices which  must be  taught.
Verification vision is top  down or model driven vision, and involves
predicting an image followed by  comparing the predicted image and  a
perceived  image for  differences  which  are  expected but  not  yet
measure.  Descriptive vision is bottom  up or data  driven vision and
involves converting  the image into  a representation  that makes  it
possible (or easier) to do the desired  vision task.  I would like to
call  this  third  kind  of  vision  "revelation  vision"  at  times,
although the  phrase "descriptive vision"  is the  term used by  most
members of the computer vision community.
{|λ10;JU;FA}
Box 6.2 {JC} THREE BASIC MODES OF VISION.

	1. Recognition Vision - Feature Classification. (bottom up into a prejudiced top).
	2. Verification Vision - Model Driven Vision. (nearly pure top down vision).
	3. Descriptive Vision - Data Driven Vision. (nearly pure bottom up vision).
{|λ30;JU}
	There are now enough concept to outline a feedback system.
By placing  a 3-D geometric model between top and bottom; recognition vision can
be done mapping 3-D (rather than 2-D) features into the task  world model;
and descriptive  vision and verification  vision could be used  to link
the  2-D and  3-D models  in a  relatively dumb,  mechanical fashion.
Previous attempts to use recognition vision,  to  bridge directly the
gap between 2-D  images (of  3-D objects)  and the  task world
model, have  been frustrated  because  the characteristic  2-D  image
features of  a  3-D object  are very  dependent on  the 3-D  physical
processes  of occultation,  rotation  and illumination.   It is these
processes that  will have  to be  modeled and  understood before  the
features  relevant to  the task  processor  can be  deduced from  the
perceived  images.  The arrangement  of  these elements  is diagramed
below.{|λ10;JA}
Box 6.3 {JC} BASIC FEEDBACK VISION SYSTEM DESIGN.

{JC} Task World Model
{JC} ↑
{JC} RECOGNITION
{JC} ↑
{JC} 3-D geometric model
{JC} ↑            ↓
{JC} DESCRIPTION        VERIFICATION
{JC} ↑            ↓
{JC} 2-D images
{|λ30;JU}
	The  lower part  of the  above
diagram is  the feedback  loop of  the  3-D geometric
vision system. Depending on circumstances,  the vision  system may
run almost  entirely top-down  (verification vision)  or
bottom-up  (revelation vision).  Verification vision  is all  that is
required in a well know predictible environment; whereas,  revelation
vision is required  in a brand new (tabula  rasa) or rapidly changing
environment.  Thus revelation and verification form a loop, bottom-up
and top-down. First,  there is  revelation that unprejudically builds
a  3-D model;  and second,   the model  is verified by  testing image
features predicted from the assumed model.  This loop  like structure
has been  noted before by others;  it is a form  of what [Tenebaum]
called  <accomodation>  and it  is  a  form of  what  [Falk] called
<heuristic vision>; however I will go along with what  I think is the
current majority of vision workers who call it <feedback vision>.

	Completing  the   design,     the  images   and  worlds   are
constructed, manipulated and compared by a variety of processors. The
topmost of which is the  task processor. Since the task processor  is
expected to  vary with the application;  it would be expedient  if it
could  be isolated as a  user  program that calls on utility routines
of  an appropriate  vision  sub-system.  Immediately below  the  task
processor  are the  3-D  recognition routines  and  the 3-D  modeling
routines. The  modeling routines  underlie most  everything  because
they are used to create, alter and access the models.{
|;λ10;JAFA}
Box 6.4	{JC} PROCESSORS OF A 3-D VISION SYSTEM.
{↓}	
	0. The task processor.
	1. 3-D recognition.
	2. 3-D modeling  routines.
	3. Reality simulator.
{↑;W560;}
	4. Image analyser.
	5. Image synthesizer.
	6. Locus solvers.
	7. Comparators: 2D and 3D.
{|;λ30;JUFA}
	The remaining processors include the  reality simulator which
does mechanics for modeling motion, collision and gravity.
Also there  are image  analyzers,   which  do image  enhancement  and
conversions  such as  converting  video rasters  into line  drawings.
There  is an  image synthesizer, which  does hidden  line and surface
elimination, for verification by comparing synthetic  images from the
model  with perceived  images of  reality. There  are three  kinds of
locus solvers that compute numerical descriptions for cameras,  light
sources and physical objects.   Finally,  there is of  course a large
number (at least  ten) different compare processors for confirming or
denying correspondences among entities in each of the different kinds
of images and 3-D models.

⊂6.2	Vision Tasks.⊃

	The 3-D  vision research problem  being discussed is  that of
finding  out how to  write programs that  can see in  the real world.
Related vision problems  include: modeling  human
perception,  solving visual  puzzles (non-real world), and developing
advanced automation  techniques (ad hoc vision). In order to approach
the problem,  specific  programming tasks are proposed  and solutions
are sought. However please  distingush the idea of a research problem
from that of a programming task; as will be illustrated, many  vision
tasks can be  done without vision.   The vision solution to  be found
should  be  able  to  deal with  real  images,    should include  the
continuity of the  visual process in  time and  space, and should  be
more general  purpose and less ad hoc.    These  three  requirements
(reality,   continuity, and generality) will be  developed by surveying
six examples of computer vision tasks.{Q}
{|;λ10;JAFA}
BOX 6.5{JC}	SIX EXAMPLES OF COMPUTER VISION TASKS.
{↓}
<Cart Related Tasks>.
	1. The Chauffeur Task.
	2. The Explorer Task.
	3. The Soldier Task.
{↑;W650;}
<Table Top Related Tasks>.
	4. Turntable Task.
	5. The Blocks Task.
	6. Machine Assembly Tasks.
{|;λ30;JUFA}
	First, there  is the  robot chauffer  task.   In 1969,   John
McCarthy asked  me to consider the vision  requirements of a computer
controlled car such as he depicted in an unpublished essay.  The idea
is that a user of such  an automatic car would request a destination;
the  robot would select a  route from an  internally stored road map;
and it would then proceed to its destination using  visual data.  The
problem  involves  representing the  road  map  in  the computer  and
establishing the correspondence between the map and the appearance of
the road  as  the automatic  chauffer drived  the  vehicle along  the
selected route.   Lacking a computer controlled car,  the problem was
abstracted to that of tracing a route along the driveways and parking
lots that  surround the Stanford  A.I. Laboratory using  a television
camera  and transmitter mounted on a  radio controlled electric cart.
The robot chauffer task could  be solved by non-visual means  such as
by railroad  like guidance or  by inertial guidance;  to preverse the
vision aspect  of the  problem,   no particular  artifacts should  be
required along a route (landmarks must be found, not placed); and the
extent of inertial dead reckoning should be noted.

	Second,  there is the task of a robot explorer.  In [McCarthy
1964] there is a description of a robot for exploring Mars. The robot
explorer was required  to run for long periods of  time without human
intervention because the signal transmission time to Mars is as great
as twenty minutes and because  the 23.5 hour Martian day  would place
the  vehicle out of  Earth sight for  twelve hour  at a time.   (This
latter difficulty could be avoided at the expense of having a set  of
communication relay satellites in orbit around Mars). The task of the
explorer  would be to drive  around mapping the  surface, looking for
interesting features,  and doing various experiments.  To be prudent,
a Mars explorer  should be able to navigate without  vision; this can
be  done  by driving  slowly  and by  using a  tactile  collision and
crevasse detector.  If the television system fails,  the core samples
and so on  can still be collected at  different Martian sites without
unusual risk to the vehicle due to visual blindness.

	The third vision  task is that  of the robot soldier,   tank,
sentry, pilot or  policeman.  The problem has several forms which are
quite similar to the chauffeur  and the explorer with the  additional
goal of doing something to coerce  an opponent.  Although this vision
task has  not yet been explicitly attempted at Stanford,  to the best
of my knowledge, the reader should be warned that a thorough solution
to any of the other  tasks almost assures the Orwellian technology to
solve this one.

	Fourth, the turntable task is to construct a 3-D model  from
a sequence of 2-D  television images taken of an object  rotated on a
turntable.   The turntable task was  selected as a simplification of
the explorer  task and  is an  example of  a  nearly pure  desriptive
vision task.

	Fifth, the classic blocks vision  task consists of two parts:
first  convert a  video image into  a line  drawing; second,   make a
selection from a  set of predefined  prototype models of blocks  that
accounts  for the line  drawing.   In my opinion,   this  vision task
emphasives three pitfalls:  single image vision,   line drawings  and
blocks. The greatest pitfall, in the usual blocks vision task, is the
presumption  that a  single  image is  to be  solved;  thus diverting
attention  away  from  the   two  most  important  depth   perception
mechanisms which are motion parallax  and stereo parallax. The second
pitfall is that the usual notion of a perspective line drawing is not
a natural intermediate state; but is rather a  very sophisticated and
platonic geometric  idea. The perfect line  drawing lacks photometric
information; even a line drawing  with perfect shadow lines  included
will not resemble anything  that can readily be gotten  by processing
real television pictures.  Curiously, the lack of success in deriving
line drawings  from real  television images  of real  blocks has  not
dampened  interest in  solving the  second part  of the  problem. The
perfect  line  drawing puzzle,  was  first  worked on  by  [Guzman] and
extended to perfect shadows by [Waltz]; nevertheless, enough remains so
that  the puzzle  will  persist on  its  own merits,   without  being
closely relevant to real world  computer vision.  Even assuming  that
imperfect line drawings are given,  the  the blocks themselves,  have
lead such researchers as [Falk] and [Grape] to concentrate on vertex/edge
classification schemes which have  not be extended beyond the  blocks
domain. The  blocks task could  be rehabilitated by  concentrating on
photometric   modeling  and   the  use  multiple   images  for  depth
perception.

	Sixth,  the Stanford  Artifical  Intelligence Laboratory  has
recently  (1974) begun  work on a  National Science  Foundation Grant
supporting research in  automatic machine  assembly.  In  particular,
effort  will  be  directed  to  developing  techniques  that  can  be
demonstrated  by  automatically  assemblying  a  chain  saw  gasoline
engine. Two  vision questions  in such  a machine  assembly task  are
where  is the part  and where  is the hole;  these questions  will be
initially handled by  composing ad  hoc part and  hole detectors  for
each vision step required for the assembly.

	The point of this  task survey was to delimit what  is and is
not a  task requiring real 3-D vision; and  to point out that caution
has to be taken  to preserve the vision aspects  of a given task.  In
the usual course  of vision projects, a single task  or a single tool
unfortunately  dominates the research;  my work is  no exception, the
one tool is 3-D modeling,  and the task that dominated  the formative
stages  of the  research is  that of  the robot  chauffered cart.   A
better understanding of the ultimate nature of computer vision can be
obtained by keeping the several tasks and the several tools in mind.

⊂6.3	Vision System Design Arguments.⊃

	The physical information most directly  relevant to vision is
the location,  extent and light scattering properties of solid opaque
objects; the location,   orientation  and projection of  the camera  that
takes the  pictures; and the  location and  nature of the  light that
illuminates  the world.    The transformation  rules of  the everyday
world that  a  programmer  may assume,  a  priori,  are the  laws  of
physics.  The arguments against  geometric modeling, divide
into two catagories: the reasonable and the intuitive.
The reasonable  arguments  attack 3-D  geometric modeling  by
comparing it  to another modeling alternative, (some alternatives are
listed in the box immediately below).  Actually, the  domains
of  efficiency of  the  possible  kinds  of models  do  not
greatly overlap;  and an artificial intellect  will have some portion of
each  kind.  Nevertheless, I  feel  that  3-D  geometric  modeling  is
superior for  the task at  hand, and that  the other models  are less
relevant to vision.{Q}
{|;λ10;JAFA}
BOX 6.6{JVJC} Alternatives to 3-D Geometric Modeling in a Vision System.{I∂20,0;}
		1. Image memory and with only the camera model in 3-D.
		2. Statistical world model, e.g. Duda & Hart.
		3. Procedural Knowledge, e.g. Hewett & Winograd.
		4. Semantic knowledge, e.g. Wilkes & Shank.
		5. Formal Logic models, e.g McCarthy & Hayes.
		6. Syntactic models.
{|;λ30;JUFA}

	Perhaps the best alternative  to a 3-D geometric model  is to
have  a library  of little  2-D images  describing the  appearance of
various 3-D loci from given directions.  The advantage would  be that
a sophisticated image  predictor would not be required;  on the other
hand the  image library is potentially quite large and that even with
a huge  data base  new views  and  lighting of  familair objects  and
scenes  can  not be  anticipated.  The  statistical  model, is  quite
relevant to vision and can be added to the geometric model.  However,
the statistical model  can not stand  alone because the  processes of
occultation,  rotation and illumination make the approach infeasible.

	Procedural knowledge models  represent the world in  terms of
routines (or actors) which either know or can compute the answer to a
question about the world. Semantic models represent the world in term
of a data structure of conceptual statements; and formal logic models
represent the  world in terms of first order predicate calculus or in
terms of a  situation calculus. The  procedural, semantic and  formal
logic world  models are all general enough  to represent a
vision model and in a theoretical sense they are merely other notations
for  3-D  geometric modeling.    However  in  practice,  these  three
modeling   regimes  are  not   efficient  holders   and  handlers  of
quantitative geometric  data; but are  rather intended  for a  higher
level  of  abstract reasoning.  Another  alleged  advantage of  these
higher  models  is  that they  can  represent  partial  knowledge and
uncertainty,  which  in  a  geometric  model  is   implicit, in  that
structures  are  missing or  incomplete.  For  example, McCarthy  and
Feldman demand that when a robot has only seen the front of an office
desk that the model should be able to  draw inferences about the back
of the desk; I  feel that this so called advantage is not required by
the problem  and that basic  visual modeling  is on  a more  agnostic
level.

	The syntactical  approach to  descriptive vision  is that  an
image  is a sentence  of a picture  grammar and  that consquently the
image description should be given in terms of the sequence of grammar
transformations rules. Again this  paradigm is valid in principle but
impractical   for  real   images  of   3-D  objects   because  simple
replacements rules  can not readily  express rotation,   perspective,
and photometric  transformations. On the other  hand, the syntactical
models have been of  some use in describing projections of 3-D objects,
[Gipps].

	The intuitive arguments  include the opinions  that geometric
modeling is too numerical, too exact, or too non-human to be relevant
for computer vision research. Against such intuitions, I wish to pose
two fallacies. First,  there is the natural mimicry  fallay, which is
that it  is false to insist that a machine must mimic nature in order
to achieve  its  design  goals. Boeing  747's  are not  covered  with
feathers;  trucks do  not  have legs;  and computer  vision  need not
simulate human vision.   The advocates of  the uniqueness of  natural
intellegence  and perception  will  have to  come  up with  a  rather
unusual  uniqueness  proof to  establish  their conjecture.    In the
meantime, one  should be  open  minded about  the potential  forms  a
perceptive counsciousness can take.

	Second,  there is  the self  introspection fallacy,  which is
that  it is false  to insist  that one's introspections  about how he
thinks and  sees are  direct observations  of thought  and sight.  By
introspection  some conclude that  the visual  models (even on  a low
level) are essentially qualitative rather than quantative.  My belief
is that the vision processing of the  brain is quite quantitative and
only passes into qualities at a higher level of processing. In either
case, the exact details  of human visual processing are  inaccessible
to conscious self inspection.

	Although, describing  the above two fallacies  might soften a
person's  prejudice  against  numerical  geometric  modeling,    some
important argument  or idea  is missing  that would  convince the  so
prejudiced of  the importance of  numerical models prior  to the full
achievement of computer  vision (vice  versa,   I have  not heard  an
argument that  would change my  prejudice in  favor of such  models).
This matter  of conflicting intuitions would not  be important,  were
it not that the "they" include so many of my immediate collegues. (Of
course,  I  may well be proved wrong if  really powerful 3-D computer
vision  systems are  ever  built without  using any  geometric models
worth speaking of,  perhaps employing an  elaborate stimulus response
paradigm).

⊂6.4	Mobile Robot Vision.⊃

	The elements  discussed so far  will now be  brought together
into a system design for performing mobile robot vision. The proposed
system is illustrated below in the block diagram in box 6.7. (The
diagram is called a mandala in that
a <mandala> is any circle-like system diagram). Although, the robot
chauffered cart  was the  main task  theme for this research;  I have
failed  to date, August 1974,  to achieve  the hardware  and software
required to  drive  the  cart around  the  laboratory under  its  own
control. Nevertheless,   this necessarily theoretical cart system has
been of  considerible  use  in developing  the  visual  3-D  modeling
routines and theory, which are the subject of this thesis.
{|;JV;FA}
BOX 6.7{JC} CART VISION MANDALA.
{W300;λ4;F2}
 →→→→→→→→→→→→→→→→→→→ PERCEIVED →→→→→→ REALITY →→→→→→ PREDICTED →→→→
 ↑	               WORLD         SIMULATOR         WORLD      ↓
 ↑  								  ↓
 ↑								  ↓
 ↑                   PERCEIVED →→→→→→  CART →→→→→→→→ PREDICTED →→→↓
 ↑	            CAMERA LOCUS      DRIVER        CAMERA LOCUS  ↓
 ↑	                ↑		↓		   	  ↓
 ↑	                ↑		↓		   	  ↓
 ↑                      ↑	      THE CART	     PREDICTED→→→→↓
BODY                 CAMERA			     SUN LOCUS 	  ↓
LOCUS		     LOCUS				 	  ↓
SOLVER		     SOLVER				          ↓
 ↑			↑				          ↓
 ↑			↑			 	          ↓
REVEAL 	             VERIFY				       IMAGE  
COMPARE		     COMPARE				 SYNTHESIZER
 ↑   ↑	 	      ↑   ↑				          ↓
 ↑   ↑                ↑   ↑ 				          ↓
 ↑   ←←	PERCEIVED→→→→→↑   ↑←←←←←←←←←←←←←←←←←←←←	PREDICTED  ←←←←←←←↓
 ←←←←← MOSAIC IMAGE			      MOSAIC IMAGE        ↓
	   ↑					   ↑	          ↓
	   ↑					   ↑	          ↓
	   ↑					   ↑              ↓
	PERCEIVED			        PREDICTED         ↓
      CONTOUR IMAGE			      CONTOUR IMAGE       ↓
	   ↑					   ↑ 	          ↓
	   ↑					   ↑	          ↓
	   ↑					   ↑	          ↓
	PERCEIVED				PREDICTED ←←←←←←←←←
       VIDEO IMAGE			       VIDEO IMAGE
	   ↑
	   ↑
	   ↑
       TELEVISION
	 CAMERA

{|;λ30;JUFA}
	The   robot   chauffer   task   involves   establishing   the
correspondence between an internal road map and the appearance of the
road in order to steer a vehicle along a predefined path. For a first
cut, the planned route  is assumed to be clear, and  the cart and the
sun  are assumed  to be the  only movable  things in  a static world.
Dealing with moving  obstacles is  a second problem,   motion thru  a
static world must be dealt with first.

	The cart  at the Stanford  Artificial Intelligence Laboratory
is intended for outdoors use and consists of a piece of plywood, four
bicycle wheels, six electric motors, two car batteries,  a television
camera,   a television transmitter, a box of  digital logic, a box of
relays,   and a  toy airplane  radio receiver.    (The vehicle  being
discussed is  not "Shakey",   which belongs  to the  Stanford Reseach
Institute's  Artificial Intelligence Group.  There  are two A.I. labs
near Stanford and  each has a  computer controlled vehicle). The  six
possible cart actions are: run forwards,  run backwards, steer to the
left,  steer to the right, pan camera to the left,  pan camera to the
right.   Other than  the television  camera,   there is no  telemetry
concerning the state of the cart or its immediate environment.
{|;λ10;JAFA}
BOX 6.8 {JC} A POSSIBLE CART TASK SOLUTION.
	 	1. Predict (or retrieve) 2D image features.
		2. Perceive (take) a television picture and convert.
		3. Compare (verify)  predicted and perceived features.
		4. Solve for camera locus.
		5. Servo the cart along its intended course.
{|;λ30;JUFA}
	The solution to the  cart problem, begins with the  cart at a
known  starting position  with a  road map  of visual  landmarks with
known loci. That is,  the upper leftmost  two rectangles of the  cart
mandala  are initialized  so that  the perceived  cart locus  and the
perceive  world correspond with  reality.  Flowing across  the top of
the mandala, the  cart driver, blindly  moves the cart forward  along
the desired route by dead reckoning (say the cart moves five feet and
stops) and the driver updates the predicted cart locus.  The  reality
simulator is  an identity in  this simple case  because the  world is
assumed static.  Next the image synthesizer uses the predicted world,
camera and sun to compute a predicted image containing  the landmarks
features  expected to  be in  view.  Now, in  the lower  left of  the
mandala,  the cart's television camera takes  a perceived picture and
(flowing upwards) the picture  is converted into a form  suitable for
comparing and  matching with the  predicted image. Features  that are
both predicted  and perceived  and found  to match  are used  by  the
camera locus  solver to compute  a new  perceived camera locus  (from
which  the cart locus can  be deduced). Now the  cart driver compares
the perceived and  the predicted cart locus  and corrects its  course
and moves the cart again, and so on.

	The remaining limb of the cart mandala is invoked in order to
turn  the chauffer into an  explorer.  Perceived  images are compared
thru time by the reveal compare  and new features are located by  the
body locus solver and placed into the world model.
The generality  and feasibility  of such  a cart  system
depends  almost entirely on  the representation of the  world and the
representation  of  image  features.  (The  more  general,  the  less
feasible). Although, the bulk  of the rest of this document developes
polyhedral representation  for the  sake of  photometric  generality;
four simpler cart systems could be realized by using simpler models.

	A first system, consists of a road map,  a road model, a road
model  generator,  a solar  emphemeris,  an image  predictor an image
comparator, a camera  locus solver, and a  course servo routine.  The
roadways and nearby environs are  entered into the computer. In fact,
real  roadways are constructed from a  two dimensional (X,Y) allignment
map showing way the  center of the road  goes as a curve  composed of
line  segement and circular  arcs; and  a second two  dimensional (S,Z)
elevation diagram; showing the height of the surface above sea  level
as a  funtion of  distance along the  road; as  a sequence of  linear
grades and vertical  arcs which (not too surprising) are nearly cubic
splines. A second version, is  like the first except the road  model,
road model generator,  and image predictor are replaced  by a library
of  road images.  In this  system the  robot vehicle is  trained by
being driven down the roads  it is suppose to follow. A  third system
is like  the first except that  the road map is  not initially given,
and indeed  the road  is no  longer presumed  to exist.  Part of  the
problem becomes finding a road, a road in the  sense of a clear area;
this version yeilds  the cart explorer and if the clear area is found
quite rapidly and the world is updated quite frequently, the explorer
can be a  chauffer that can handle obstacles and  moving objects. The
fourth system  is like the third, except that the world is modeled by
a  single  valued  surface  elevation  function,  rather  than  by  a
polyhedral model.

⊂6.5	Summary and Related Vision Work.⊃

	To recapitulate, three vision system design requirements were
postulated: reality,  generality,  and continuity. These requirements
were illustrated  by discussing  a number  of  vision related  tasks.
Next, a vision  system was described as mediating  between 2-D images
and  a world model;  with the  world model being  further broken down
into a  3-D geometric  model and a  task world  model. Between  these
entities  three  basic  vision  modes were  identified:  recognition,
verification  and  revelation  (description).  Finally,  the  general
purpose vision system was depicted as  a quantitative and description
oriented feedback cycle  which maintain a 3-D geometric model for the
sake of higher qualitative,  symbolic, and recognition oriented  task
processors.
Approaching the vision system in greater  detail; the role of
seven (or so)  essential kinds of processors were explained: the task
processor,   3-D  modeling  routines,    reality  simulator,    image
analyser,   image synthesizer, comparators,   and locus  solvers. The
processors and data types were assembled into a cart chauffer system.

	Larry Roberts is  justly credited for doing the  seminal work
in 3-D  Computer Vision; although his thesis [Roberts] appeared over ten years 
ago the subject has languished  dependent on and overshadowed by  the
four areas  called: Image Processing,   Pattern Recognition, Computer
Graphics,     and  Artificial   Intelligence.  Outside  the  computer
sciences, workers in psychology, neurology and philosophy also seek a
theory of vision.

	Image  Processing  involves  the  study  and  development  of
programs that enhance,   transform and compare 2D images.  Nearly all
image processing work can eventually be applied to computer vision in
various circumstances. A good survey of this field can be found in an
article  by Rosenfeld(69).   Image  Pattern Recognition  involves two
steps: feature extraction  and classification.  A  comprehensive text
about this field with respect to computer vision, has been written by
[Duda and Hart].  Computer  Graphics is the inverse of  discriptive
computer vision.   The problem of  computer graphics is  to synthesis
images  from  three dimensional  models; the  problem  of discriptive
computer vision is to  analyze images into three dimensional  models.
An introductory  text book about this  field would be that  of [Newman
and  Sproull]. Finally, there is Artificial Intelligence, which in
my opinion  is  an institution  sheltering  a heterogenous  group  of
embryonic computer  subjects; the biggest of the  present day orphans
include: robotics,    natural language,    theorem proving,    speech
analysis, vision and planning. A more  narrow and relevant definition
of artificial intelligence is that it concerns the programming of the
robot task processor which sits above the vision system. There  is no
general  reference   on  Artificial  Intelligence  that   I  wish  to
recommend.

	The related vision  work of specific individuals  has already
been mention  in context.  To summarize,   the present vision work is
related to the  early work of Roberts(63)  and Sutherland; to  recent
work  at  Stanford: Falk,    Feldman  and Paul(67),    Tenenbaum(72),
Agin(72),   Grape(73);  to  the work  at MIT:  Guzman,   Horn, Waltz,
Krakaurer; to the work  at the University of Utah:  Warnock, Watkins;
and to work at other places: SRI and JPL. Future progress in computer
vision will proceed in  step with better  computer hardware,   better
computer  graphics  software, and  better  world  modeling  software.
Further  vision work  at Stanford,  which is  related to  the present
theory is be done by Lynn Quam and Hans Morevac. The machine assembly
task is  being pursued both  by the Artificial Intelligence  Group of
the  Stanford  Research Institute  and  by  the Hand  Eye  Project at
Stanford University.  Because the demand  for doing practical  vision
tasks can be satisfied with existing ad hoc methods or by not using a
visual sensor at all;  I expect little or  no vision progress per  se
from such reseach,   although their demonstrations should  be robotic
spectaculars. Since,   the missing ingredient  for computer vision is
the spatial  modeling to  which  perceive images  can be  related;  I
believe  that  the  development  of  the  technology  for  generating
commercial  film and  television by  computer for  entertainment will
make significant contribution to computer vision.